### Challenges on Designing Parallel Processing for Realizing Real-Time Applications

July 12th, 2019



Architect, President & CEO Yukoh Matsumoto, Ph.D.

**TOPS Systems Corp.** 





## Agenda

- Development Services / TOPS Systems
  - Efficient Parallel Processing Software and Hardware
- Demand of Parallel Processing in Real-Time Systems
  - > ML, NW, 5G, Brake System, Free 3D View, Robotics Control
- Challenges on Designing Parallel Processing
  - Free Lunch is over in 2005, but ...
  - Many Performance Black Boxes in SW and HW
- Methodologies : Designing Time and Parallelism
  - Models of Processing
  - White Box Performance based Parallel Processing Design
- Architecture : Removing Inter-Core Overhead
  - Communication and Synchronization
- Solutions





## My principle of Life

When we do our best, Ideas come up.

When we do incomplete, **Complains** come up.

When we are lazy, **Excuses** come up.

Takeda Shingen (1521-1573)

In Japanese,

- ー生懸命だと、智慧がでる
- 中途半端だと、愚痴がでる

いい加減だと、言い訳がでる





When we do our best, Ideas come up!





## **Vision & Technologies**

Parallel Processing is a key at All Layers!



MPSoC'19



# **Development Services / TOPS Systems**<sup>5/34</sup>



**Today's Talk : Ideas from experiences with customers** 





#### Market Trend toward Parallel Processing



Deep learning chipsets by type. Source: Tractica

More than 90% of Hardware Platform is for Parallel Processing





**Examples : Next-Generation Automotive** 



#### **TSN/In Vehicle Network**

Efficient Parallel Processing is must for hight Performance and Energy-Efficiency





**Examples of Parallelization Requirements from Constraints for Edge** 

|             |                   | Metrics                 | Unit                   | FFT                                                       | DNN                                                                         | 10GbE    | Security | ISP                                                          |
|-------------|-------------------|-------------------------|------------------------|-----------------------------------------------------------|-----------------------------------------------------------------------------|----------|----------|--------------------------------------------------------------|
| Performance | Computing         | Floating                | TFLOPS                 | 1                                                         | n/a                                                                         | n/a      | n/a      | n/a                                                          |
|             |                   | Integer                 | TOPS                   | n/a                                                       | 1~10                                                                        | 1~       | 0.01     | 1~                                                           |
|             | Memory            | Bandwidth               | GB/Sec                 | 32                                                        | 1~2                                                                         | 30       | 1.2      | 4~                                                           |
|             | Power Consumption |                         | W                      | ~10                                                       | 1~1.5                                                                       | ~5       | 0.1      | ~1                                                           |
|             | Chip Size         | Area @ 28nm             | mm <sup>2</sup>        | 100                                                       | 50                                                                          | 100      | 10       | 50                                                           |
|             | Cost              | Device                  | \$                     | 50                                                        | 50                                                                          | 50       | 1        | 3                                                            |
| Efficiency  | Hardware          | Energy-<br>Efficiency   | TFLOPS/W               | 0.1                                                       | n/a                                                                         | n/a      | n/a      | n/a                                                          |
|             |                   |                         | TOPS/W                 | n/a                                                       | 1                                                                           | ~0.2     | 0.01     | 1                                                            |
|             |                   | Area Efficiency         | TFLOPS/mm <sup>2</sup> | 0.01                                                      | n/a                                                                         | n/a      | n/a      | n/a                                                          |
|             |                   |                         | TOPS/mm <sup>2</sup>   | n/a                                                       | 0.02~0.2                                                                    | 0.01~    | 0.001    | 0.02                                                         |
|             |                   | Cost Efficiency         | TFLOPS/\$              | 0.02                                                      | n/a                                                                         | n/a      | n/a      | n/a                                                          |
|             |                   |                         | TOPS/\$                | n/a                                                       | 0.02~0.2                                                                    | 0.02~0.2 | 0.01     | 0.33                                                         |
| Other       | Scalability       | Maximum<br>Frequency    | MHz                    | 300                                                       | 300                                                                         | 300      | 300      | 300                                                          |
|             |                   | Parallel<br>Processing  | Number of<br>Parallels | 3k                                                        | 3k~30k                                                                      | 3k       | 16       | 3kpixel                                                      |
|             | Note              | Design<br>Consideration |                        | 1MFFT/s<br>~4096pnt<br>SingleFP<br>16Layer<br>Prallel FFT | 30fps<br>Full-HD<br>Object Rec<br>GoogLeNet<br>AlexNet<br>Squeeze<br>RezNet |          |          | 7 Mpixel<br>120fps<br>Focus<br>White Ballance<br>Color Conv. |

**3k-30k** Parallel Processing is required to meet Performance and Efficiancy





Edge Computing on Cyber Physical System (CPS)



Efficient Parallel Processing on Edge Devices for High-Throuput and Low-Latency





#### Real-Time System vs. General Purpose System



Processing of Real-Time system is type of Stream Processing based on Dataflows





### Challenges on Designing Parallel Processing What currently facing in the industry

- **#1. Free Lunch is over**, but still stick with sequential programming
- **#2. Almost No-Effort on Software Design**
- **#3. Lack of Expression of Timing and Parallelism in Design tools**
- **#4. Many Performance Black Boxes in Software and Hardware stack**
- **#5. Limited Scalability in Hardware Platform**

Very hard to utilize potential performance of CPU, GPU, FPGA





# Challenges on Designing Parallel Processing

#### Free Lunch is over in 2005, but ...



Prepared by C. Batten - School of Electrical and Computer Engineering - Cornell University - 2005 - retrieved Dec 12 2012 -

http://www.csl.cornell.edu/courses/ece5950/handouts/ece5950-overview.pdf

Conventional Software maybe slower

#### **#1.** Free Lunch is over, but still stick with sequential programming











# Challenges on Designing Parallel Processing

Many Performance Black Boxes in Design Methodologies



**MPSoC'19** 

TOPS.

# Challenges on Designing Parallel Processing

**Performance Black Boxes = Unpredictable/Uncontrollable Execution Time** 

### *Excution time = Number of Instruction × CPI × Cycle Time*

### Software

- Number of Instructions
  - **Compiler Optimization**
  - Interrupt
  - Scheduling/OS

### Hardware

- > CPI
  - □ Instruction Cache Miss
  - **D** Data Cache Miss
  - **D** Branch Prediction Miss
  - Super Scalar Scheduling
  - □ Inter-Core Synchronization
- Cycle Time
  - Dynamic Frequency Scaling

#### **#4. Many Performance Black Boxes in Hardware stack**







### How to approach the challenges





#### **Utilizes all types of Parallelism inherent in Applications**



Dataflow allows easy extraction and optimization of spatial parallel processing





#### **Basic Model of Processing**

### **Conventional Expression**

Flow Chart, Sequence Diagram



- Issue : Implicit Performance information
- > No Data Size
- No Data Dependency
- No Timing

### Execution Time = function(input, parameter, state)





- Input: Byte
- Output: Byte
- Parameter: Byte
- State: Byte
- Logic: Number of Inst.
- Execution Time: Cycles





Model of Parallel Processing : DeepPN

- Issues : Current Programming limits Parallel Processing
- Objectives: Define how to express parallel processing exlicitly
  - Easy to Express parallelism inherit in Real-Time Applications
  - Easy to speed-up Real-Time applications by optimization of
    - **□** Efficient Parallel Processing
    - Less Memory Access
    - □ Simple Communication
    - □ Simple Synchronization
    - **Deterministic**
    - Stream Processing
- DeepPN (Deep Process Network)



**GLOBAL PARAMETERS** 

#### DeepPN : Dataflow Graph, Hiearchy, Gobal Variable





**Model of Parallel Execution Time** 



#### Parallel Execution Time

- 1 Internal Processing
- **2** Inter-Core Conflict
- **③ Inter-Core Communication**
- **④** Inter-Core Synchronization
- ; ALU, Memory(LD, ST), Branch
- ; Shared Memory Access
- ; Data Transmit and Receive
- ; Wait for Lock, Barrier

For faster processing, minimize each by design optimization





21/34

′•**\_**•

TOPS.

#### For Predictable/Controllable Processing Execution Time



**Control and Predict Dynamic Behavior** 



**Modeling & Simulation of Parallel Processing Performance** 



**MPSoC' 19** 



23/34

•\_•

TOPS

#### **Guidelines and Tools**



**MPSoC' 19** 

#### **Decrese Computing Time and Increase Energy-Efficiency**



**Increase Energy-Efficiency** 

 $Performance^{\uparrow} = OPC^{\uparrow} \times f \downarrow$  $Power \downarrow = \frac{1}{2} \alpha C V^2 f \downarrow$ 

| Theory for Decrease<br>Computing Time | Break Down               | Traditional<br>Approaches | SMYLEdeep's<br>Approaches |  |
|---------------------------------------|--------------------------|---------------------------|---------------------------|--|
| Less Cycle Time                       | <b>Clock Frequency</b>   | High Clock Rate(GHz)      | Low Clock Rate(100MHz)    |  |
|                                       | ALU                      | SIMD(8-32packed)          | Special(-768operations)   |  |
|                                       | Load                     | Single, Block             | Less, Stream(No Latency)  |  |
| Less Number of                        | Store                    | Single, Block             | Less, Stream(No Latency)  |  |
| Instructions                          | Branch                   | Bcc, Jmp, Call/Ret        | Bcc, Jmp, Call/Ret        |  |
|                                       | Communication            | Store & Load              | None                      |  |
|                                       | Synchronization          | Lock                      | None(Prefix)              |  |
|                                       | Cache                    | Instruction/Data          | Instruction/Large RegFile |  |
| Less CPI<br>(Clock per Instruction)   | Super Scalar             | -4way                     | None                      |  |
| (clock per instruction)               | <b>Branch Prediction</b> | Yes(due to deep pipe)     | No(shallow pipeline)      |  |

**Tightly Coupled Multi-ISA cores optimum for Efficient Parallel Processing** 

**MPSoC'19** 

#### **Common Issue on Parallel Processing**



**Goal : Increase Parallel Protion and Minimize Overhead** 





#### Zero-Overhead Message Passing (ZOMP)

Traditional Parallel Processing



Remove Overhead due to Inter-Core Communication and Synchronization





No Store-Load for communication and No Lock required for synchronization



#### ZOMP

**Communication by RF sharing Synchronization through Event bus** 



#### Zero Cycle required

\*Only prefix-instruction (SYNC<sup>®</sup>) is required for inter-core communication and synchronizatin

Two Programs can communicate and synchronize with Zero Cycle!





You can scaler performance of Multicore linearly with number of cores



## Architecture : SMYLE<sub>Deep</sub>

**Tightly Coupled Cores for efficient parallel processing** 



MPSoC'19



Parallel Processing on SMYLEdeep

- Task Parallel
  - Each core executes different tasks independently with MIMD
- Data Parallel
  - Each core execute same code as SIMD
- Pipeline Parallel
  - Each core executes different tasks independently with MIMD and pass the output to next core through R-bus with zero cycle
- Mixture of Task, Data, Pipeline processing





### **Types of Hardware Architecture**

Utilize Advantage of configurability in Time and Space for Parallel Processing



SMYLEdeep: Tightly Coupled Parallel Processing Architecture





### **Solutions**

#### **#1** Top down design with Time and Parallelism

Avoid troublesome exploration of parallelism at implementation

#### **#2 Intensive Software Design** prior to programming

- Migrate to explicit Processing Model
- Provide Parallel Software Design Guideline and Software Tools

#### **#3 Structural coding of Parallelism**

Shifting to modules according to Dataflow

#### #4 Design Optimization based on White Box Performance

- Quantitative Performance Analysis and Estimation
- Remove Unpredictability / Uncontrollability

#### **#5 Scalable and Deterministic Hardware Platform**

Provide SMYLE<sub>RT</sub> for optimum and efficient real-time processing

#### We provides services and products for "Highly Efficient-Parallel Processing"





## **Toward Efficient Parallel Processing** <sup>33/34</sup>

## Changes

### from Programming to Software Design from Shared Memry to Direct Inter-Core Comm.





### Looking for Academic and Industrial Partners to provide solutions for WW MPSoC users!

**Go Next Generation** 



yukoh@topscom.co.jp



